Skip to content

perf: CS write buffering#749

Open
dmga44 wants to merge 2 commits intodevfrom
perf-cs-write-buffering
Open

perf: CS write buffering#749
dmga44 wants to merge 2 commits intodevfrom
perf-cs-write-buffering

Conversation

@dmga44
Copy link
Collaborator

@dmga44 dmga44 commented Feb 16, 2026

This commit introduces the write buffering feature to the chunkserver.
This feature changes the chunkserver behavior the way that it fast
replies the write data requests before they are actually in the drives
as it is expected. Before the write data request is replied:

  • the sequence of packets sent from the client should be the same, i.e
    WRITE_INIT should be sent first.
  • the job open in the begining of the write operation must be
    processed.
  • all the data to be written from the write data request should have
    been read from network.
  • there should be available write buffering blocks, this counter
    depends on the new option WRITE_BUFFERING_SIZE_MB and is maintained
    when clearing and destroying InputBuffer instances.
  • the chunk should be locked: master must have enabled the
    USE_CHUNKSERVER_SIDE_CHUNK_LOCK option. This constraint is specially
    important to handle the write errors that may occur on chunks
    supposedly written successfully.

The standard chunks do not apply for these fast replies.

Considering the fast replies, some WriteHighLevelOp might receive
WRITE_END packets before all the data it should write is actually
written in disk. In such cases, those instances become sealed:

  • no more network IO related to that write high level operation.
  • it will wait until the buffered data is actually synced before
    disappearing. The writeFinishedCallback handles the subsequent
    writes of the already pending input buffers, see also
    continueWritingIfPossible.

This explains why now the ChunkserverEntry has a list of write high
level operations instead of a single one, all the writeHLO instances
except the last one must be sealed.

Some write operations in the client suppose a previous block is written
and attempts reading it back to rebuild the parity parts. Those blocks
could be not actually written in the disk, so the reads needs to be
patched with the data allegedly synced to the drives, i.e the already
buffered data.

It was also fixed the CS chunk locks when master disconnects because
the current sequence of operations after the master connection gets
killed, either network error or actual master crash/unexpected stop,
is flawed. The issue is the following:

  • there is a write error (the status passed in the write end status) on
    a locked chunk while the master is disconnected
  • after the master comes back, the chunk is registered as if it were a
    good one, while it should have broken data.

The proposed solution is to change the callback to those lock jobs so
that the chunk parts with broken data gets deleted and unregistered in
the master if the master is back.

Side changes:

  • some tests needed to be patched to make them pass, like the ones
    testing the option CREATE_EMPTY_FOLDERS_WHEN_SPACE_DEPLETED.
  • created loadReloadableSettings function to avoid code duplication.

Related to LS-315.

To the reviewers, some conversations with Gemini and also its summary could prove useful while reviewing.

Signed-off-by: Dave dave@leil.io

@dmga44 dmga44 self-assigned this Feb 16, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances write performance and consistency in the system by introducing write buffering and chunkserver-side chunk locking. These changes allow chunkservers to process write operations more efficiently, provide faster acknowledgments to clients, and enable the master to maintain a more accurate state of chunk parts being written. The refactoring of write handling logic and updates to communication protocols lay the groundwork for more robust and performant distributed storage operations.

Highlights

  • Write Buffering Implementation: Introduced write buffering capabilities on the chunkserver side, allowing for asynchronous write acknowledgments and improved performance by enabling instant replies to clients.
  • Chunkserver-Side Chunk Locking: Implemented a new mechanism for chunkservers to manage chunk locks, ensuring consistency during write operations and enabling the master to track which chunk parts are actively being written.
  • Master-Chunkserver Protocol Enhancements: Extended the communication protocols between master and chunkservers to support new operations like CREATE_AND_LOCK_CHUNK, SET_VERSION_AND_LOCK, LOCK_CHUNK, UNLOCK_CHUNK, and DUPLICATE_AND_LOCK_CHUNK.
  • Refactored Write Handling Logic: Updated the WriteHighLevelOp and ChunkserverEntry classes to handle multiple concurrent write operations, manage input buffers more efficiently, and integrate the new chunk locking and instant reply features.
  • New Configuration Option: Added an experimental configuration option USE_CHUNKSERVER_SIDE_CHUNK_LOCK to enable or disable the new chunkserver-side locking functionality.
  • System Test Enhancements: Added new system tests to validate the write buffering and chunk locking mechanisms, including scenarios for concurrent writes and disk space management.
Changelog
  • CMakeLists.txt
    • Updated minimum version to 5.8.0.
  • doc/sfsmaster.cfg.5.adoc
    • Documented the new USE_CHUNKSERVER_SIDE_CHUNK_LOCK experimental configuration option.
  • src/admin/dump_config_command.cc
    • Added USE_CHUNKSERVER_SIDE_CHUNK_LOCK to the default master options.
  • src/chunkserver/bgjobs.cc
    • Added addJobIfNotLocked and addLockJob methods to manage job queuing based on chunk lock status.
    • Implemented chunk locking and unlocking logic within the JobPool.
  • src/chunkserver/bgjobs.h
    • Declared new AddJobFunc type, addJobIfNotLocked, addLockJob, startChunkLock, enforceChunkLock, releaseChunkLockEntry, endChunkLock, eraseChunkLock methods.
    • Introduced LockedChunkData struct to hold information about locked chunks.
  • src/chunkserver/bgjobs_unittest.cc
    • Added a new test case to verify chunk locking functionality.
  • src/chunkserver/chunk_high_level_ops.cc
    • Implemented trySeal and tryInstantReply for write buffering and immediate client responses.
    • Updated delayedClose and cleanup logic to correctly handle buffered writes and chunk locks.
    • Integrated chunk locking enforcement into the setup method.
  • src/chunkserver/chunk_high_level_ops.h
    • Added new members and methods to WriteHighLevelOp for write buffering and chunk locking.
    • Exposed parentPendingWriteJobs for managing pending write jobs.
  • src/chunkserver/chunkserver_entry.cc
    • Refactored WriteHighLevelOp to support multiple instances using a list.
    • Integrated tryInstantReply and clearCompletedWriteHLO for efficient write processing.
    • Updated write handling methods to interact with the new WriteHighLevelOp list.
  • src/chunkserver/chunkserver_entry.h
    • Modified ChunkserverEntry to use a list of WriteHighLevelOp instances.
    • Added pendingWriteJobs counter for tracking active write operations.
  • src/chunkserver/hddspacemgr.cc
    • Added functions to manage gAlreadyRepliedInputBuffers for tracking buffered write data.
    • Implemented hddUpdateOutputBufferWithAlreadyRepliedInputBuffers to incorporate buffered data into read operations.
  • src/chunkserver/hddspacemgr.h
    • Declared new functions for managing already replied input buffers.
  • src/chunkserver/io_buffers.cc
    • Implemented modifyAvailableWriteBufferingBlocks and getAvailableWriteBufferingBlocks for managing write buffer capacity.
    • Added updateIntervalBlockData and updateBlockCRC to OutputBuffer for data integrity.
    • Modified InputBuffer destructor and clear method to correctly handle repliedBlocks.
  • src/chunkserver/io_buffers.h
    • Added new methods to OutputBuffer and InputBuffer for write buffering and block data management.
  • src/chunkserver/master_connection.cc
    • Added new master-chunkserver communication handlers for CREATE_AND_LOCK_CHUNK, SET_VERSION_AND_LOCK, LOCK_CHUNK, UNLOCK_CHUNK, and DUPLICATE_AND_LOCK_CHUNK.
  • src/chunkserver/master_connection.h
    • Declared new methods in MasterConn for handling chunk locking and new packet types.
  • src/chunkserver/masterconn.cc
    • Removed gDoTerminate and masterconn_wantexit for cleaner termination logic.
    • Added masterconn_get_job_pool to expose the job pool.
  • src/chunkserver/masterconn.h
    • Added masterconn_get_job_pool declaration.
  • src/chunkserver/network_main_thread.cc
    • Introduced gDoTerminate and doTerminate for graceful shutdown.
    • Integrated modifyAvailableWriteBufferingBlocks for WRITE_BUFFERING_SIZE_MB reload functionality.
  • src/chunkserver/network_main_thread.h
    • Declared doTerminate function.
  • src/chunkserver/network_worker_thread.cc
    • Integrated clearCompletedWriteHLO and tryInstantReply into servePoll for active write management.
    • Removed closeJobs call from askForTermination to streamline shutdown.
  • src/chunkserver/network_worker_thread.h
    • Added gWriteBufferingSize_mb and kDefaultWriteBufferingSize_mb for write buffering configuration.
  • src/common/event_loop.h
    • Updated documentation for eventloop_time return type to seconds.
  • src/common/saunafs_version.h
    • Added kFirstVersionWithChunkserverSideChunkLock to track feature availability.
  • src/data/sfsmaster.cfg.in
    • Added USE_CHUNKSERVER_SIDE_CHUNK_LOCK configuration option with default value 0.
  • src/master/chunks.cc
    • Added beingWritten flag to ChunkPart to indicate active write status.
    • Updated isLocked logic to consider beingWritten status.
    • Introduced new chunk operation functions (chunk_create_operation, chunk_increase_version_operation, chunk_lock_operation, chunk_duplicate_operation, chunk_truncate_operation, chunk_duplicate_and_truncate_operation).
    • Refactored chunk_multi_modify and chunk_multi_truncate to utilize the new operation functions.
    • Implemented chunk_write_end_status and chunk_got_chunklock_status for handling write completion and lock status.
  • src/master/chunks.h
    • Updated chunk_multi_modify signature.
    • Added declarations for new chunk operation status functions.
  • src/master/filesystem_operations.cc
    • Removed usedummylockid parameter from writeChunk.
    • Updated changeLog call to reflect new operation flag interpretation.
  • src/master/filesystem_operations.h
    • Removed usedummylockid parameter from writeChunk declaration.
  • src/master/filesystem_operations_interface.h
    • Removed usedummylockid parameter from writeChunk virtual function.
  • src/master/matoclserv.cc
    • Removed kUseDummyLockId and usedummylockid parameter from writeChunk calls.
  • src/master/matocsserv.cc
    • Modified matocsserv_getservers_for_new_chunk to return min_server_count.
    • Updated matocsserv_send_createchunk, matocsserv_send_setchunkversion, matocsserv_send_duplicatechunk to support needsLock and sentChunkLock parameters.
    • Added matocsserv_send_chunklock, matocsserv_got_chunklock_status, matocsserv_got_writeend_status, and matocsserv_send_chunkunlock for new locking features.
  • src/master/matocsserv.h
    • Updated matocsserv_getservers_for_new_chunk signature.
    • Added declarations for new chunk locking/unlocking functions.
  • src/master/restore.cc
    • Removed false for usedummylockid parameter in writeChunk call.
  • src/mount/writedata.cc
    • Simplified processJob by removing haveDataToWrite flag.
    • Ensured chunkserver communication always happens to release locks.
    • Adjusted error handling for SAUNAFS_ERROR_LOCKED.
  • src/protocol/SFSCommunication.h
    • Added new communication message IDs for CREATE_AND_LOCK_CHUNK, DUPLICATE_AND_LOCK_CHUNK, SET_VERSION_AND_LOCK, LOCK_CHUNK, CSTOMA_LOCK_CHUNK, CSTOMA_WRITE_END_STATUS, and UNLOCK_CHUNK.
  • src/protocol/cstoma.h
    • Defined new packet serialization for chunkLock and writeEndStatus.
  • src/protocol/cstoma_unittest.cc
    • Added unit tests for ChunkLock and WriteEndStatus packets.
  • src/protocol/matocs.h
    • Defined new packet serialization for setVersionAndLock, chunkLock, chunkUnlock, createAndLockChunk, and duplicateAndLockChunk.
  • src/protocol/matocs_unittest.cc
    • Added unit tests for SetVersionAndLock and ChunkLock packets.
  • tests/test_suites/SanityChecks/test_crc_error_fixing.sh
    • Added a sleep 1 command before damaging a chunk to ensure proper state.
  • tests/test_suites/ShortSystemTests/test_allow_empty_folders_metadata_after_full_disk.sh
    • Added WRITE_BUFFERING_SIZE_MB = 0 to chunkserver extra configuration.
  • tests/test_suites/ShortSystemTests/test_concurrent_random_writes_on_chunk.sh
    • Increased timeout_set and times_to_repeat for more extensive testing.
    • Added a background thread to dynamically switch USE_CHUNKSERVER_SIDE_CHUNK_LOCK during the test run.
  • tests/test_suites/ShortSystemTests/test_disable_empty_folders_metadata_after_full_disk.sh
    • Added WRITE_BUFFERING_SIZE_MB = 0 to chunkserver extra configuration.
  • tests/test_suites/ShortSystemTests/test_ec_truncate_atomicity.sh
    • Added a sleep 2 command to allow chunkservers to flush their data.
  • tests/test_suites/ShortSystemTests/test_wb_allow_empty_folders_metadata_after_full_disk.sh
    • Added a new system test for write buffering with allowed empty folder creation after disk full.
  • tests/test_suites/ShortSystemTests/test_wb_disable_empty_folders_metadata_after_full_disk.sh
    • Added a new system test for write buffering with disabled empty folder creation after disk full.
  • tests/tools/saunafs.sh
    • Added USE_CHUNKSERVER_SIDE_CHUNK_LOCK = 1 to the default master configuration.
    • Introduced add_lines_master_cfg_ function for dynamic configuration updates.
    • Modified sfschunkserver_check_no_buffer_in_use to reload master configuration with specific settings.
Activity
  • The pull request is currently marked as 'Work In Progress' (WIP) by dmga44, indicating ongoing development.
  • No other human activity such as comments or reviews has been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance enhancement through chunkserver-side write buffering and instant replies, underpinned by a new chunk locking mechanism. The changes are extensive, touching the master, chunkserver, and protocol layers. The core idea is to allow chunkservers to acknowledge writes to the client before the data is persisted to disk, which is made safe by locking the chunk on the master to prevent conflicting operations. This required substantial modifications to WriteHighLevelOp, ChunkserverEntry, and JobPool. The read path has also been updated to correctly serve data from these in-memory write buffers. A major and very welcome part of this PR is the significant refactoring of chunk_multi_modify in chunks.cc into smaller, more understandable functions, which greatly improves code clarity and maintainability. The changes appear well-thought-out, especially regarding consistency and error handling in this new asynchronous model. I have one suggestion for a minor improvement regarding type safety in job functions.

@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch 7 times, most recently from 3471071 to 209e321 Compare February 23, 2026 10:31
@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch 7 times, most recently from 2ad5155 to 3394f92 Compare March 3, 2026 15:37
@dmga44 dmga44 requested a review from Copilot March 3, 2026 15:37
@dmga44
Copy link
Collaborator Author

dmga44 commented Mar 3, 2026

@gemini-code-assist can you please review this PR?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a write buffering mechanism to improve write performance by allowing instant replies to clients for non-standard slices, provided buffering capacity is available. Key changes include refactoring WriteHighLevelOp to manage multiple concurrent write operations per ChunkserverEntry using a list, and adding a tryInstantReply() method. A new MasterJobPool::changeLockJobsCallback function and related masterconn utilities were added to handle master server disconnections by updating lock job callbacks and reporting lost chunks if errors occur. The hddspacemgr now includes mechanisms (hddInsertAlreadyRepliedInputBuffer, hddRemoveAlreadyRepliedInputBuffer, hddUpdateOutputBufferWithAlreadyRepliedInputBuffers) to store and retrieve already replied write data, enabling reads to be patched from these buffers before data is flushed to disk. The io_buffers module was updated with an atomic counter (gAvailableWriteBufferingBlocks) to manage total write buffering capacity, and InputBuffer now tracks repliedBlocks. Configuration for WRITE_BUFFERING_SIZE_MB was added, and several system tests were updated or created to validate the new functionality. Review comments highlighted a potential race condition in inputBuffer->writeInfo_ due to concurrent modification and access, suggested optimizing a range-based for loop in WriteHighLevelOp::trySeal() by using a const reference, recommended adding comments to clarify complex logic in hddUpdateOutputBufferWithAlreadyRepliedInputBuffers, and pointed out duplicated code for WRITE_BUFFERING_SIZE_MB configuration loading.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces chunkserver-side write buffering (configurable via WRITE_BUFFERING_SIZE_MB) with logic to allow early client replies and to patch subsequent reads with buffered-but-not-yet-flushed data. It also extends the test suite to cover full-disk metadata behavior with write buffering enabled/disabled.

Changes:

  • Add write-buffering capacity tracking and config reload/init wiring for WRITE_BUFFERING_SIZE_MB.
  • Implement “instant reply” / buffered-write flow in chunkserver write high-level ops and patch reads using already-replied write buffers.
  • Add/adjust system tests to validate behavior when disk space is depleted with write buffering on/off.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/tools/saunafs.sh Sets a default WRITE_BUFFERING_SIZE_MB in generated chunkserver configs for tests.
tests/test_suites/ShortSystemTests/test_wb_disable_empty_folders_metadata_after_full_disk.sh New test: full-disk behavior with WB enabled and empty-folder creation disabled.
tests/test_suites/ShortSystemTests/test_wb_allow_empty_folders_metadata_after_full_disk.sh New test: full-disk behavior with WB enabled and empty-folder creation allowed.
tests/test_suites/ShortSystemTests/test_disable_empty_folders_metadata_after_full_disk.sh Updates test to explicitly disable WB.
tests/test_suites/ShortSystemTests/test_allow_empty_folders_metadata_after_full_disk.sh Updates test to explicitly disable WB.
tests/test_suites/GaneshaTests/test_nfs_ganesha_disable_empty_folders_metadata_after_full_disk.sh Updates NFS-Ganesha test to explicitly disable WB.
tests/test_suites/GaneshaTests/test_nfs_ganesha_allow_empty_folders_metadata_after_full_disk.sh Updates NFS-Ganesha test to explicitly disable WB.
src/chunkserver/network_worker_thread.h Adds global/default for write-buffering size config.
src/chunkserver/network_worker_thread.cc Calls per-loop write maintenance (everyLoopUpdateWrite).
src/chunkserver/network_main_thread.cc Loads/reloads WRITE_BUFFERING_SIZE_MB and updates available buffering blocks.
src/chunkserver/masterconn.cc Adjusts lock-job callbacks on master disconnect; reports lost chunks after delete jobs.
src/chunkserver/io_buffers.h Adds buffer patching helpers, write-buffering block accounting APIs, and new InputBuffer accessors.
src/chunkserver/io_buffers.cc Implements write-buffering block accounting + new buffer patching helpers.
src/chunkserver/hddspacemgr.h Exposes lost-chunk reporting + insert/remove already-replied write buffers for read patching.
src/chunkserver/hddspacemgr.cc Stores already-replied write buffers and patches reads from them before returning to client.
src/chunkserver/chunkserver_entry.h Tracks pending write jobs, supports multiple concurrent write HLOs, adds per-loop write maintenance hook.
src/chunkserver/chunkserver_entry.cc Switches to a list of write HLOs, adds lifecycle/close handling, and integrates instant-reply updates.
src/chunkserver/chunk_high_level_ops.h Adds sealing/instant-reply APIs and tracks delayed-close/sealed state.
src/chunkserver/chunk_high_level_ops.cc Implements sealing + instant replies + delayed-close behavior and read-patching integration.
src/chunkserver/bgjobs.h Adds a callback-maker type and API to change lock-job callbacks.
src/chunkserver/bgjobs.cc Implements changeLockJobsCallback to update callbacks for lock jobs on disconnect.

@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch from 3394f92 to b66c491 Compare March 4, 2026 16:05
@dmga44 dmga44 requested a review from Copilot March 4, 2026 16:30
@dmga44
Copy link
Collaborator Author

dmga44 commented Mar 4, 2026

@gemini-code-assist
Please review again.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental write buffering feature in the chunkserver to improve write performance, especially for small files. However, it also introduces several critical and high-severity security vulnerabilities in the chunkserver's write handling logic. These include a process crash when receiving out-of-order packets, resource exhaustion through unbounded object creation, an information leak due to insufficient validation of untrusted packet fields, and a logic error in the new write buffering mechanism that can lead to data corruption for unaligned read requests. Furthermore, enabling this experimental feature by default across the entire test suite could introduce instability in existing tests. These issues must be addressed before the changes are merged.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 8 comments.


You can also share your feedback on Copilot code review. Take the survey.

@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch from b66c491 to 001f505 Compare March 5, 2026 02:53
@dmga44 dmga44 requested a review from Copilot March 5, 2026 02:58
@dmga44
Copy link
Collaborator Author

dmga44 commented Mar 5, 2026

@gemini-code-assist
Please review again.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance enhancement with the new write buffering feature in the chunkserver. While the implementation aims for robustness and includes a valuable fix for chunk lock handling, it unfortunately introduces several critical and high-severity security vulnerabilities. Specifically, a critical race condition in the handling of buffered write information can lead to Use-After-Free (UAF) vulnerabilities and crashes, reinforced by existing guidelines on shared resource synchronization. Furthermore, logic flaws in the data patching mechanism for read operations could allow an attacker to trigger assertion failures, resulting in a Denial of Service (DoS) of the chunkserver. Additionally, a minor code refactoring suggestion has been made for better maintainability. Addressing these security vulnerabilities is paramount before merging this feature.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch from 001f505 to 004f1a3 Compare March 5, 2026 04:18
@dmga44 dmga44 requested a review from Copilot March 5, 2026 04:18
@dmga44
Copy link
Collaborator Author

dmga44 commented Mar 5, 2026

@gemini-code-assist
Please review again.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance enhancement with the write buffering feature in the chunkserver. The implementation is comprehensive, touching upon I/O handling, state management, and error recovery. I appreciate the thoroughness, including the fix for chunk lock handling during master disconnections and the refactoring to support multiple write operations, which improves code clarity. The addition of new tests specifically for write buffering is excellent. I've found one issue related to incorrect validation of write offsets, which I've detailed in a specific comment.

Note: Security Review did not run due to the size of the PR.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

@dmga44 dmga44 marked this pull request as ready for review March 5, 2026 04:33
@dmga44
Copy link
Collaborator Author

dmga44 commented Mar 5, 2026

@gemini-code-assist
Can you please write relatively detailed PR description (for this PR)?

@gemini-code-assist
Copy link

This pull request introduces a significant performance enhancement to the chunkserver by implementing write buffering and chunkserver-side chunk locking. The primary goal is to improve write performance and consistency, particularly by reducing client-perceived latency for write operations.

Problem Addressed:
Previously, a chunkserver would only acknowledge a write operation after the data had been fully written to disk. This "write-through" approach, while ensuring data durability, could introduce significant latency, especially for clients with high throughput requirements or when dealing with many small files. Furthermore, managing consistency during concurrent writes and master disconnections posed challenges.

Solution: Write Buffering
The core of this enhancement is the introduction of write buffering. The chunkserver can now provide a "fast reply" to clients, acknowledging write data requests as if the data were already synced to disk, even though it's only been received from the network and stored in memory (similar to a page cache). This allows clients to proceed without waiting for the physical disk write, drastically improving perceived write performance.

Key aspects of the write buffering implementation include:

  • Asynchronous Acknowledgments: Clients receive SAUNAFS_STATUS_OK much faster.
  • In-memory Buffering: Data is temporarily held in InputBuffer instances.
  • WriteHighLevelOp Refactoring: The WriteHighLevelOp class has been updated to manage these buffered writes. When a WRITE_END packet is received from the client before all buffered data is written to disk, the WriteHighLevelOp instance becomes "sealed." It then waits for the buffered data to be synced to disk before being fully cleared.
  • Concurrent Writes: The ChunkserverEntry now maintains a list of WriteHighLevelOp instances, allowing it to handle multiple concurrent write operations efficiently.
  • Read Patching: To ensure data consistency for subsequent read operations, a mechanism (hddUpdateOutputBufferWithAlreadyRepliedInputBuffers) is introduced. This patches read requests with data that has been acknowledged to the client but might still reside in the in-memory buffers, ensuring clients always see the most up-to-date data.

Constraints for Write Buffering:
For fast replies to be enabled, several conditions must be met:

  • The client must send WRITE_INIT first.
  • The initial write operation job must be processed.
  • All data for the write request must have been read from the network.
  • There must be available write buffering blocks, controlled by the WRITE_BUFFERING_SIZE_MB configuration option.
  • The chunk must be locked, requiring the master to have USE_CHUNKSERVER_SIDE_CHUNK_LOCK enabled.
  • Write buffering does not apply to standard and replica chunks, focusing on other chunk types where this optimization is more beneficial and safe.

Solution: Chunkserver-Side Chunk Locking
To maintain data consistency and enable reliable write buffering, a new chunkserver-side chunk locking mechanism has been implemented. This ensures that the master can accurately track which chunk parts are actively being written, especially when acknowledgments are sent before data is persisted.

This involves:

  • Enhanced Master-Chunkserver Protocol: New communication messages have been added, such as CREATE_AND_LOCK_CHUNK, SET_VERSION_AND_LOCK, LOCK_CHUNK, UNLOCK_CHUNK, and DUPLICATE_AND_LOCK_CHUNK. These allow the master to explicitly manage chunk locks on the chunkservers.
  • Improved Master Disconnect Handling: A critical fix addresses a scenario where a master disconnects during a write error on a locked chunk. Previously, such chunks could be registered as good upon master reconnection, leading to data inconsistency. The new solution modifies the callback for lock jobs to delete and unregister chunk parts with broken data from the master if an error occurred and the master is back online. This prevents corrupted data from being considered valid.

Configuration:

  • A new experimental configuration option, WRITE_BUFFERING_SIZE_MB, has been added to sfschunkserver.cfg.5.adoc and src/data/sfschunkserver.cfg.in to control the maximum amount of data buffered in memory.
  • The USE_CHUNKSERVER_SIDE_CHUNK_LOCK option in the master configuration is crucial for enabling the chunk locking mechanism required by write buffering.

Testing:
New system tests have been added and existing ones patched to validate the write buffering and chunk locking mechanisms, covering scenarios like concurrent writes and disk space management, including specific tests for WRITE_BUFFERING_SIZE_MB and USE_CHUNKSERVER_SIDE_CHUNK_LOCK.

This feature, tracked under LS-315, significantly enhances the chunkserver's ability to handle write operations efficiently and robustly.

@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch 3 times, most recently from 749e769 to 6cdce8d Compare March 6, 2026 03:58
The current sequence of operations after the master connection gets
killed, either network error or actual master crash/unexpected stop,
is flawed the following way:
- there is a write error (the status passed in the write end status) on
a locked chunk while the master is disconnected
- after the master comes back, the chunk is registered as if it were a
good one, while it should have broken data.
The proposed solution is to change the callback to those lock jobs so
that the chunk parts with broken data gets deleted and unregistered in
the master if the master is back.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch 2 times, most recently from 5f59125 to 8a406c0 Compare March 6, 2026 15:42
This commit introduces the write buffering feature to the chunkserver.
This feature changes the chunkserver behavior the way that it fast
replies the write data requests before they are actually in the drives
as it is expected. Before the write data request is replied:
- the sequence of packets sent from the client should be the same, i.e
WRITE_INIT should be sent first.
- the job open in the begining of the write operation must be
processed.
- all the data to be written from the write data request should have
been read from network.
- there should be available write buffering blocks, this counter
depends on the new option WRITE_BUFFERING_SIZE_MB and is maintained
when clearing and destroying InputBuffer instances.
- the chunk should be locked: master must have enabled the
USE_CHUNKSERVER_SIDE_CHUNK_LOCK option. This constraint is specially
important to handle the write errors that may occur on chunks
supposedly written successfully.

The standard chunks do not apply for these fast replies.

Considering the fast replies, some WriteHighLevelOp might receive
WRITE_END packets before all the data it should write is actually
written in disk. In such cases, those instances become sealed:
- no more network IO related to that write high level operation.
- it will wait until the buffered data is actually synced before
disappearing. The writeFinishedCallback handles the subsequent
writes of the already pending input buffers, see also
continueWritingIfPossible.
This explains why now the ChunkserverEntry has a list of write high
level operations instead of a single one, all the writeHLO instances
except the last one must be sealed.

Some write operations in the client suppose a previous block is written
and attempts reading it back to rebuild the parity parts. Those blocks
could be not actually written in the disk, so the reads needs to be
patched with the data allegedly synced to the drives, i.e the already
buffered data.

Side changes:
- some tests needed to be patched to make them pass, like the ones
testing the option CREATE_EMPTY_FOLDERS_WHEN_SPACE_DEPLETED.
- created loadReloadableSettings function to avoid code duplication.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the perf-cs-write-buffering branch from 8a406c0 to 9338ff1 Compare March 6, 2026 16:20
Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass 1

for f in test/*; do
assert_eventually_prints "" "saunafs fileinfo '$f' | grep ':${info[chunkserver0_port]}'"
assert_eventually_prints 2 "saunafs fileinfo '$f' | grep copy | wc -l"
# Assert that data is replicated to chunkservers 1, 2 and no chunk is stored on cs 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And 3?

## Caps the amount of data in MB the chunkserver "buffers", i.e fast replies to the
## client as if it was synced to the drives but it was only read from network and
## stored in memory. Similar to page cache. This feature has the following constraints:
## - does not work on standard and replica chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider:
standard (replica) chunks

USE_RAMDISK=YES \
CHUNKSERVER_EXTRA_CONFIG="READ_AHEAD_KB = 1024|MAX_READ_BEHIND_KB = 2048" \
CHUNKSERVER_EXTRA_CONFIG="READ_AHEAD_KB = 1024|MAX_READ_BEHIND_KB = 2048 \
|WRITE_BUFFERING_SIZE_MB = 0" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed here?

starving the other.
Not reloadable. (Default: FIFO)

*WRITE_BUFFERING_SIZE_MB (EXPERIMENTAL)*:: Caps the amount of data in MB the chunkserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiB or MB?

hddAddErrorAndPreserveErrno(chunk);
safs::log_warn("{}: file:{} - write error", errorMsg,
chunk->fullMetaFilename().c_str());
chunk->fullDataFilename().c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .c_str() should not be needed anymore with the new logging functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants